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Abstract 

We describe a number of experiments 
that demonstrate the usefulness of 
prosodic information for a processing 
module which parses spoken utterances 
with a feature-based grammar employing 
empty categories. We show that by re- 
quiring certain prosodic properties from 
those positions in the input, where the 
presence of an empty category has to be 
hypothesized, a derivation can be accom- 
plished more efficiently. The approach 
has been implemented in the machine 
translation project Verbmobil and re- 
sults in a significant reduction of the 
work-load for the parser^. 

1 Introduction 

In this paper we describe how syntactic and 
prosodic information interact in a translation 
module for spoken utterances which tries to meet 
the two - often conflicting - main objectives, the 
implementation of theoretically sound solutions 
and efficient processing of the solutions. 

As an analysis which meets the first criterion 
but seemingly fails to meet the second one, we take 
an analysis of the German clause which relies on 
traces in verbal head positions in the framework of 
Head-driven Phrase Structure Grammar (HPSG, 
cf. (Pollard&Sag, 1994)). 

The methods described in this paper have 
been implemented as part of the IBM-SynSem- 
Module and the FAU-Erlangen/LMU-Munich- 
Frosody-Module in the MT project VERBMOBIL 
(cf. (Wahlster, 1993)) where spontaneously spo- 
ken utterances in a negotiation dialogue are trans- 
lated. In this system, an HPSG is processed by a 
bottom-up chart parser that takes word lattices as 
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man Federal Ministry for Research and Technology 
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under Grant #01 IV 101 V (Verbmobil). The respon- 
sibility for the contents of this study lies with the au- 
thors. 



its input. The output of the parser is the seman- 
tic representation for the best string hypothesis in 
the lattice. 

It is our main result that prosodic informa- 
tion can be employed in such a system to de- 
termine possible locations for empty elements in 
the input. Rather than treating prosodic informa- 
tion as virtual input items which have to match 
an appropriate category in the grammar rules 
(Bear&Price, 1990), or which by virtue of being 
'unknown' in the grammar force the parser to close 
off the current phrase (Marcus&Hindle, 1990), our 
parser employs prosodic information as affecting 
the postulation of empty elements. 

2 An Hpsg Analysis of German 
Clause Structure 

Hpsg makes crucial use of "head traces" to ana- 
lyze the verb-second (V2) phenomenon pertinent 
in German, i.e. the fact that finite verbs appear in 
second position in main clauses but in final posi- 
tion in subordinate clauses, as exemplified in (la) 
and (lb). 

1. (a) Gestern reparierte er den Wagen. 
(Yesterday fixed he the car) 
'Yesterday, he fixed the car.' 

(b) Ich dachte, dafi er gestern den Wagen 
reparierte. 

(I thought that he yesterday the car 
fixed) 

'I thought that he fixed the car yester- 
day'. 

Following (Kiss&Wesche, 1991) we assume that 
the structural relationship between the verb and 
its arguments and modifiers is not affected by the 
position of the verb. The overt relationship be- 
tween the verb 'repanerte' and its object ^ den Wa- 
gen^ in (lb) is preserved in (la), although the verb 
shows up in a different position. The apparent 
contradiction is resolved by assuming an empty 
element which serves as a substitute for the verb 
in second position. The empty element fills the po- 
sition occupied by the finite verb in subordinate 



clauses, leading to the structure of main clauses 
exemplified in (2). 



Gestern 




XO-i 



agen 



(2): Syntax tree for 'Gestern reparierte 
er den Wagen.' 

The empty verbal head in (2) carries syntac- 
tic and semantic information. Particularly, the 
empty head licenses the realization of the syntac- 
tic arguments of the verb according to the rule 
schemata of German and Hpsg's Subcategoriza- 
tion Principle. 

The structure of the main clause presented in 
(2) can be justified on several grounds. In partic- 
ular, the parallelism in verbal scope between verb 
final and V2 clauses - exemplified in (3a) and (3b) 
- can be modeled best by assuming that the scope 
of a verb is always determined w.r.t. the final po- 
sition. 

3. (a) Ich glaube, du sollst nicht toten. 
(I believe you shall not kill) 
'I believe you should not kill.' 
(b) Ich glaube, dafi du nicht toten sollst. 
(I believe that you not kill shall) 
'I believe that you should not kill.' 

In a V2 clause, the scope of the verb is deter- 
mined with respect to the empty verbal head only. 
Since the structural position of an empty verbal 
head is identical to the structural position of an 
overt finite verb in a verb final clause, the invari- 
ance does not come as a surprise. 

Rather than exploring alternative approaches 
here, we will briefly touch upon the representa- 
tion of the dependency in terms of Hpsg's featu- 
ral architecture. Information pertaining to empty 
heads are projected along the Double Slash 
(Dsl) feature instead of the Slash feature (cf. 
(Borsley, 1989)). The empty head is described in 
(4) where the Local value is coindexed with the 
Dsl value. 



PHON 

SYNSEM 



elist 
LOG 

NONLOC 



DSL { 1 } 



(4): Feature description of a head trace 

The Dsl of a head is identical to the Dsl of the 
mother, i.e. DsL does not behave like a Nonlo- 
cal but like a Head feature. 



A Dsl dependency is bound if the verbal pro- 
jection is selected by a verb in second position. 
A lexical rule guarantees that the selector shares 
all relevant information with the DsL value of the 
selected verbal projection. The relationship be- 
tween a verb in final position, a verb in second 
position and the empty head can be summarized 
as follows: For each final finite verb form, there is 
a corresponding finite verb form in second position 
which licenses a verbal projection whose empty 
head shares its Local information with the cor- 
responding final verb form. It is thus guaranteed 
that the syntactic arguments of the empty head 
are identical to the syntactic arguments required 
by the selecting verb. 

3 Processing Empty Elements 

Direct parsing of empty elements can become a 
tedious task, decreasing the efficiency of a system 
considerably. 

Note first, that a reduction of empty elements 
in a grammar in favor of disjunctive lexical rep- 
resentations, as suggested in (Pollard&Sag, 1994, 
ch.9), cannot be pursued. 

(Pollard&Sag, 1994) assume that an argument 
may occur on the SuBCAT or on the Slash list. 
A lexical operation removes the argument from 
SuBCAT and puts it onto Slash. Hence, no fur- 
ther need for a syntactic representation of empty 
elements emerges. This strategy, however, will not 
work for head traces because they do not occur as 
dependents on a SuBCAT list. 

If empty elements have to be represented syn- 
tactically, a top-down parsing strategy seems bet- 
ter suited than a bottom-up strategy. Particu- 
larly, a parser driven by a bottom-up strategy has 
to hypothesize the presence of empty elements at 
every point in the input. 

In Hpsg, however, only very few constraints are 
available for a top-down regime since most infor- 
mation is contained in lexical items. The parser 
will not restrict the stipulation of empty elements 
until a lexical element containing restrictive infor- 
mation has been processed. The apparent advan- 
tage of top-down parsing is thus lost when HPSGs 
are to be parsed. The same criticism applies to 
other parsing strategies with a strong top-down 
orientation, such as left corner parsing or head 
corner parsing. 

We have thus chosen a bottom-up parsing strat- 
egy where the introduction of empty verbal heads 
is constrained by syntactic and prosodic informa- 
tion. The syntactic constraints build on the facts 
that a) a verb trace will occur always to the right 
of its licenser and b) always 'lower' in the syntax 
tree. Furthermore c) since the DsL percolation 
mechanism ensures structure sharing between the 
verb and its trace, a verb trace always comes with 
a corresponding overt verb. 

As a consequence of c) the parser has a fully 



specified verb form - although with empty phonol- 
ogy - at hand, rather than having to cope with the 
underspecified structure in (4). This form can be 
determined at compile time and stored in the lexi- 
con together with the corresponding verb form. It 
is pushed onto the trace stack whenever this verb 
is accessed. 

Although a large number of bottom-up hy- 
potheses regarding the position of an empty el- 
ement can be eliminated by providing the parser 
with the aforementioned information, the number 
of wrong hypotheses is still significant. 

In a verb-2nd clause most of the input follows 
a finite verb form so that condition a) indeed is 
not very restrictive. Condition b) rules out a large 
number of structures but often cannot prevent the 
stipulation of traces in illicit positions. Condition 
c) has the most restrictive effect in that the syn- 
tactic potential of the trace is determined by that 
of the corresponding verb. 

If the number of possible trace locations could 
be reduced significantly, the parser could avoid a 
large number of subanalyses that conditions a)-c) 
would rule out only at later stages of the deriva- 
tion. The strategy that will be advocated in the 
remainder of this paper employs prosodic infor- 
mation to accomplish this reduction. 

Empty verbal heads can only occur in the right 
periphery of a phrase, i.e. at a phrase bound- 
ary. The introduction of empty arcs is then not 
only conditioned by the syntactic constraints men- 
tioned before, but additionally, by certain require- 
ments on the prosodic structure of the input. 

It turns out, then, that a fine-grained prosodic 
classification of utterance turns, based on corre- 
lations between syntactic and prosodic structure 
is not only of use to determine the segmentation 
of a turn, but also, to predict which positions are 
eligible for trace stipulation. The following sec- 
tion focuses on the prosodic classification schema, 
section 5 features the results of the current exper- 
iments. 

4 Classifying Prosodic Information 

The standard unit of spoken language in a dia- 
logue is the turn. A turn like (5) can be composed 
out of several sentences and subsentential phrases 
- free elements like the phrase '«m ApnV which 
do not stand in an obvious syntactic relationship 
with the surrounding material and which occur 
much more often in spontaneous speech than in 
other environments. One of the major tasks of a 
prosodic component of a processing system is the 
determination of phrase boundaries between these 
sentences and free phrases. 

5. Im April. Anfang April bin ich in Urlaub. 
Ende April habe ich noch Zeit. 
(In April beginning April am I on vacation 
end April have I still time) 



'In April. I am on vacation at the beginning 
of April. I still have time at the end of April.' 

In written language, phrase boundaries are 
often determined by punctuation, which is, of 
course, not available in spoken discourse. For the 
recognition of these phrase boundaries, we use a 
statistical approach, where acoustic-prosodic fea- 
tures are classified, which are computed from the 
speech signal. 

The classification experiments for this pa- 
per were conducted on a set of 21 human- 
human dialogs, which are prosodically labelled (cf. 
(Reyelt, 1995)). We chose 18 dialogs (492 turns, 
36 different speakers, 6996 words) for training, 
and 3 dialogs for testing (80 turns, 4 different 
speakers, 1049 words). 

The computation of the acoustic-prosodic fea- 
tures is based on a time alignment of the phoneme 
sequence corresponding to the spoken or recog- 
nized words. To exclude word recognition errors, 
for this paper we only used the spoken word se- 
quence thus simulating 100% word recognition. 
The time alignment is done by a standard hid- 
den Markov model word recognizer. For each syl- 
lable to be classified the following prosodic fea- 
tures were computed fully automatically from the 
speech signal for the syllable under consideration 
and for the six syllables in the left and the right 
context: 

• the normalized duration of the syllable nu- 
cleus 

• the minimum, maximum, onset, and offset of 
fundamental frequency (FO) and the maxi- 
mum energy and their positions on the time 
axis relative to the position of the actual syl- 
lable 

• the mean energy, and the mean FO 

• flags indicating whether the syllable carries 
the lexical word accent or whether it is in a 
word final position 

The following features were computed only for 
the syllable under consideration: 

• the length of the pause (if any) preceding or 
succeeding the word containing the syllable 

• the linear regression coefficients of the FO- 
contour and the energy contour computed 
over 15 different windows to the left and to 
the right of the syllable 

This amounts to a set of 242 features, which so 
far achieved best results on a large database of 
read speech; for a more detailed account of the 
feature evaluation, (cf. (Kiefiling, 1996)). 

The full set of features could not be used due 
to the lack of sufficient training data. Best re- 
sults were achieved with a subset of features, con- 
taining mostly durational features and FO regres- 
sion coefficients. A first set of reference labels 



was based on perceptive evaluation of prosod- 
ically marked boundaries by non-naive listen- 
ers (cf. (Reyelt, 1995)). Here, we will only 
deal with major prosodic phrase boundaries (B3) 
that correspond closely to the intonational phrase 
boundaries in the ToBI approach, (cf. (Beck- 
man&Ayers, 1994)), vs. all other boundaries (no 
boundary, minor prosodic boundary, irregular 
boundary). Still, a purely perceptual labelling of 
the phrase boundaries under consideration seems 
problematic. In particular, we find phrase bound- 
aries which are classified according to the per- 
ceptual labelling although they did not corre- 
spond to a syntactic phrase boundary. Illustra- 
tions are given below, where perceptually labelled 
but syntactically unmotivated boundaries are de- 
noted with a vertical bar. 

6. (a) Sollen wir uns dann im Monat Marz | 
einmal treffen? 

(Shall we us then in month March meet) 
'Should we meet then in March.' 
(b) Wir treffen uns am Dienstag | den 
dreizehnten April. 

(We meet us on tuesday the thirteenth 
April.) 

'We meet on tuesday the thirteenth of 
April.' 

Guided by the assumption that only the bound- 
ary of the final intonational phrase is relevant for 
the present purposes, we argue for a categorial 
labelling (cf. (Feldhaus&Kiss, 1995)), i.e. a la- 
belling which is solely based on linguistic defini- 
tions of possible phrase boundaries in German. 

Thus instead of labelling a variety of prosodic 
phenomena which may be interpreted as bound- 
aries, the labelling follows systematically the syn- 
tactic phrasing, assuming that the prosodic real- 
ization of syntactic boundaries exhibits properties 
that can be learned by a prosodic classification al- 
gorithm. 

The 21 dialogues described above were labelled 
according to this scheme. For the classification 
reported in the following, we employ three main 
labels, S3+ (syntactic boundary obligatory), S3- 
(syntactic boundary impossible), and S3? (syn- 
tactic boundary optional). Table 1 shows the cor- 
respondence between the S3 and B3 labels (not 
taking turn-final labels into account). 



cases 


B3 


not-B3 


S3+ 


844 


82 


18 


S3- 


5907 


3 


97 


S3? 


570 


32 


68 



Table 1: Correspondence between S3 and B3 
labels in %. 

Multi-layer perceptrons (MLP) were trained to 
recognize S3+ labels based on the features and 
data as described above. The MLP has one out- 
put node for S3+ and one for S3-. During training 



the desired output for each of the feature vectors 
is set to one for the node corresponding to the 
reference label; the other one is set to zero. With 
this method in theory the MLP estimates poste- 
riori probabilities for the classes under considera- 
tion. However, in order to balance for the a priori 
probabilities of the different classes, during train- 
ing the MLP was presented with an equal number 
of feature vectors from each class. For the experi- 
ments, MLPs with 40/20 nodes in the first/second 
hidden layer showed best results. 

For both S3 and B3 labels we obtained overall 
recognition rates of over 80% (cf. table 2). 

Note, that due to limited training data, errors 
in FO computation and variabilities in the acous- 
tic marking of prosodic events across speakers, di- 
alects, and so on, one cannot expect an error free 
detection of these boundaries. 

Table 2 shows the recognition results in percent 
for the S3+/S3- classifier and for the B3/not-B3 
classifier using the S3-positions as reference (first 
column) again not counting turn final boundaries. 

For example, in the first row the number 24 
means that 24% of the S3+ labels were classified 
as S3-, the number 75 means that 75% of the S3+ 
labels were classified as B3. 



cases 


S3+ 


S3- 


B3 


not-B3 


S3+ 


110 


76 


24 


75 


25 


S3- 


766 


14 


86 


14 


86 


S3? 


93 


43 


57 


46 


54 



Table 2: Recognition rates for S3 labels in % for 
S3 and B3 classifiers. 



What table 2 shows, then, is that syntactic S3 
boundaries can be classified using only prosodic 
information, yielding recognition rates compara- 
ble to those for the recognition of perceptually 
identified B3 boundaries. This means for our pur- 
poses, that we do not need to label boundaries 
perceptually, but can instead employ an approach 
as the one advocated in (Feldhaus&Kiss, 1995), 
using only the transliterated data. While this sys- 
tem turned out to be very time-consuming when 
applied to larger quantities of data, (Batliner et 
al., 1996) report on promising results applying a 
similar but less labor-intensive system. 

It has further to be considered that the recogni- 
tion rate for perceptual labelling contained those 
cases where phrase boundaries have been recog- 
nized in positions which are impossible on syntac- 
tic grounds-cf. the number of cases in table (1) 
where a S3- position was classified as B3 and vice 
versa. 

It is important to note, that this approach does 
not take syntactic boundaries and phonological 
boundaries to be one and the same thing. It is a 
well-known fact that these two phenomena often 
are orthogonal to each other. However, the ques- 
tion to be answered was, can we devise an auto- 
matic procedure to identify the syntactic bound- 



aries with (at least) about the same reliability as 
the prosodic ones? As the figures in table (2) 
demonstrate the answer to this question is yes. 

Our overall recognition rate of 84.5% for 
the S3-classifier (cf. table (2)) cannot ex- 
actly be compared with results reported in 
other studies because these studies were ei- 
ther based on read and carefully designed ma- 
terial, (cf., e.g., (Bear&Price, 1990), (Osten- 
hof&Veilleux, 1994)), or they used not auto- 
matically computed acoustic-prosodic features 
but textual and perceptual information, (cf. 
(Wang&Hirschberg, 1992)). 

5 Results 

In order to approximate the usefulness of prosodic 
information to reduce the number of verb trace 
hypotheses for the parser we examined a corpus 
of 104 utterances with prosodic annotations de- 
noting the probability of a syntactic boundary af- 
ter every given word. For every node whose S3 
boundary probability exceeds a certain threshold 
value, we considered the hypothesis that this node 
is followed by a verb trace. These hypotheses were 
then rated valid or invalid by the grammar writer. 

Note that such a setting where a position in the 
input is annotated with scores representing the re- 
spective boundary probabilities is much more ro- 
bust w.r.t unclear classification results than a pure 
binary 'boundary-vs.-nonboundary' distinction. 

The observations were rated according to the 
following scheme^: 





XO position 


no XO position 


XO prop. 


Correct: 138 


False Alarm : 274 


no XO prop. 


Miss : 6 


X : 703 



Table 3: Classification results for verb trace 
positions 

Evaluation of these figures for our test corpus 
and a threshold value of 0,01 yielded the following 
result: 



Recall 


= 95,8 % 


Precision 


= 33,5 % 


Error 


= 25,0 % 



Table 4: Recall, Precision and Error for the 
identification of possible verb trace positions. 



where: 
Recall 

Precision 

Error 



C orrect 



( C orrect-\-Mis s) 

C orrect 

(C orrect-\-Fals e) 

(Miss+False) 



(Correct+Fahe+Miss+X) 

In practice this means that the number of loca- 
tions where the parser has to assume the presence 



of a verb trace could be reduced from 1121 to 412 
while only 6 necessary trace positions remained 
unmarked. These results were obtained from a 
corpus of spoken utterances many of which con- 
tained several independent phrases and sentences. 
These segments, however, are also often separated 
by an S3-boundary, so that the error rate is likely 
to drop considerably if a segmentation of utter- 
ances into syntactically well-formed phrases is per- 
formed prior to the trace detection. Since cases 
where the verb trace is not located at the end of 
a sentence (i.e. where extraposition takes place) 
involve a highly characteristic categorial context, 
we expect a further improvement if the trace/no- 
trace classification based on prosodic information 
is combined with a language model. 

The problem with the approach described above 
is that a careful estimation of the threshold value 
is necessary and this threshold may vary from 
speaker to speaker or between certain discourse 
situations. Furthermore the analysis fails in those 
cases where the correct position is rated lower 
than this value, i.e. where the parser does not 
consider the correct trace position at all. Thus, in 
a second experiment we examined how the syntac- 
tically correct verb trace position is ranked among 
the positions proposed by the prosody module 
w.r.t. its S3-boundary probability. If the cor- 
rect position turns out to be consistently ranked 
among the positions with the highest S3 probabil- 
ity within a sentence then it might be preferable 
for the parsing module to consider the S3 posi- 
tions in descending order rather than to introduce 
traces for all positions ranked above a threshold. 

For the second experiment we considered only 
those segments in the input that represent V2 
clauses, i.e. we assumed that the input has been 
segmented correctly. Within these sentences we 
ranked all the spaces between words according to 
the associated S3 probability and determined the 
rank of the correct verb trace position. When per- 
forming this test on 134 sentences the following 
picture emerged: 



Rank 


1 


2 


3 


4 


5 


6 


7 


> 7 


# of occ. 


96 


22 


7 


4 


3 





1 


1 



^XO position means that the relevant position is 
occupied by a XO gap, XO prop. means that the 
classifier proposes an XO at this position. 



Table 5: Ranking of the syntactically correct 
verb trace position within a sentence according 
to the S3 probability. 

Table 5 shows that in the majority of cases the 
position with the highest S3 probability turns out 
to be the correct one. It has to be added though, 
that in many cases the correct verb trace position 
is at the end of the sentence which is often very 
reliably marked with a prosodic phrase boundary, 
even if this sentence is uttered in a sequence to- 
gether with other phrases or sentences. This end- 
of-sentence marker will be assigned a higher S3 
probability in most cases, even if the correct verb 
trace position is located elsewhere. 



In a third experiment finally we were interested 
in the overall speedup of the processing module 
that resulted form our approach. In order to es- 
timate this, we parsed a corpus of 109 turns in 
two different settings: While in the first round 
the threshold value was set as described above, 
we selected a value of for the second pass. The 
parser thus had to consider every postion in the 
input as a potential head trace location just as if 
no prosodic information about syntactic bound- 
aries were available at all. It turns out then (cf. 
table (6)) that employing prosodic information re- 
duces the parser runtime for the corpus by about 
46%! 





With Prosody 


Without Prosody 


Overall 


704.8 


1304.2 


Average 


6.5 


11.9 


Speedup 


45.96% 


•/• 



Table 6: Comparison of runtimes (in sees) for 
parsing batch-jobs with and without the use of 
prosodic information, resp. 



6 Conclusion 

It has been shown that prosodic information can 
be employed in a speech processing system to de- 
termine possible locations of empty elements. Al- 
though the primary goal of the categorial labelling 
of prosodic phrase boundaries was to adjust the 
division of turns into sentences to the intuitions 
behind the grammar used, it turned out that the 
same classification can be used to minimize the 
number of wrong hypothesis pertaining to empty 
productions in the grammar. 

We found a very useful correspondence between 
an observable physical phenomenon-the prosodic 
information associated with an utterance-and a 
theoretical construct of formal linguistics-the lo- 
cation of empty elements in the respective deriva- 
tion. The method has been successfully imple- 
mented and is currently being refined by train- 
ing the classifier on a much larger set of examples 
and by integrating categorial information about 
the relevant positions into the probability score 
for the various kind of boundaries. 
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